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Abstract 

Sequential Monte Carlo (SMC) methods, also known as particle filters, are simulation-based recursive 
algorithms for the approximation of the a posteriori probability measures generated by state-space 
dynamical models. At any given time a particle filter produces a set of samples over the state space 
of the system of interest. These samples are referred to as "particles" and can be used to build a 
discrete approximation of the a posteriori probability distribution of the state, conditional on a sequence 
of available observations, at time t. When new observations are collected, a recursive stochastic procedure 
allows to update the set of particles to obtain an approximation of the new posterior at time t -f 1 . 

One potential use of this particle set is to approximate the density associated to the a posteriori 
distribution. While practitioners have rather freely applied such density approximations in the past, the 
issue has received less attention from a theoretical perspective. In this paper, we address the problem of 
constructing kernel-based estimates of the filtering probability density function. Kernel methods are the 
most widely employed techniques for nonparametric density estimation and it seems natural to analyze 
their performance when applied to the approximate samples produced by particle filters. Here, we show 
how to obtain asymptotic convergence results for the particle-kernel approximations of the filtering density 
and its derivatives. In particular, we find convergence rates for the approximation error that hold uniformly 
on the state space and guarantee that the error vanishes almost surely as the number of particles in the 
filter grows. Based on this uniform convergence result, we first show how to build continuous measures that 
converge almost surely (with known rate) toward the filtering measure and then address a few applications. 
The latter include maximum a posteriori estimation of the system state using the approximate derivatives 
of the posterior density and the approximation of functionals of it, e.g.. Shannon's entropy. 
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1 Introduction 
1.1 Background 

Consider two random sequences, {Xt}t>o a-nd {Yt}t>i, possibly multidimensional, where Xt represents 
the unobserved state of a system of interest and Yt is a related observation. The aim of stochastic filtering 
methods is to perform inference on the state Xf given the observations Yi , . . . , It . Very often, the specific 
relationship between the two sequences is given by a Markov state-space model that describes 

- the conditional probability distribution of Xt given Xt-i and 

- the conditional probability distribution of the observation Yt given the state Xt . 

The probability measure that characterizes the random variable Xt conditional on the observations 
{1^5,1 < s < t} is usually termed the "filtering measure" and we denote it as tt^ in the sequel. If 
the state-space model is linear and Gaussian, ttj is also Gaussian and can be computed exactly using a set 
of recursive equations known as the Kalman filter |32| . However, if Xt takes values in a continuous space 
and the model is nonlinear or non-Gaussian, the exact filter is intractable and numerical approximation 
techniques are necessary. The class of sequential Monte Carlo (SMC) methods, also known as particle 
filters [26l [33l [36l [211 [20| , has become a very popular tool for this purpose. Particle filters generate discrete 
random measures (constructed from random samples in the state space) that can be naturally used to 
approximate integrals with respect to (w.r.t.) the filtering measure. 

The asymptotic convergence of SMC algorithms has been well studied during the past two decades. 
The first formal results appeared in [T^ [T3], while the analysis in [5] already took into account the 
branching (resampling) step indispensable in most practical applications. Currently, there is a broad 
knowledge about the convergence of particle filters in some of the forms commonly used by practitioners; 
see [16l [21 [Ml [35l m [29] and the references therein. Most of these results, however, are aimed to show that 
integrals of real functions w.r.t. tt^ can be accurately approximated by weighted sums when the particle 
filter is run with a sufficiently large number of random samples (commonly referred to as particles). 
More recently, other types of convergence have been investigated. For instance, the convergence of 
particle approximations of maximum a posteriori (MAP) estimates of sequences has also been proved. 
Convergence in probability can be shown using random genealogical trees (see |41j and [14j ) while almost 
sure convergence can also be guaranteed by extending the analysis in [Hj (see |39|). 

In most cases of interest, the filtering measure has a density, denoted pt, w.r.t. a dominating measure 
(usually Lebesgue's) and practitioners have freely used various estimators of this function. Less attention 
has been devoted to this problem from a theoretical perspective, though. Note that the samples generated 
by the particle filter are not drawn directly from pt: they can only be considered as approximate samples, 
in the sense that they can be used to estimate the value of integrals with respect to the measure tt^ . As a 
consequence, the convergence of a kernel density estimate of pt built from the output of a particle filter 
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cannot be justified directly using the classical theory of kernel density estimation, which is concerned with 
samples drawn directly from the distribution of interest (see, e.g., [T71 SH HH |33] ) . 

The estimation of pt is of interest by itself, since it naturally enables the computation of confidence 
regions, as well as MAP and maximum likelihood estimators, but also because it leads to the approximation 
of TTj by a continuous (instead of discrete) random measure. The convergence of such continuous 
approximations of the filtering measure in total variation distance have been investigated in the context 
of regularized particle filters [35] as well as for accept /reject and auxiliary particle filters [34]. In 
|35) . two variations of the standard particle (or 'bootstrap' [IHl [H]) filter, the pre-regularized and the 
post-regularized particle filters, are analyzed. The two algorithms involve the construction of kernel 
approximations of the filtering density, which are used to draw new particles and mitigate the sample 
impoverishment in the resampling step. The density estimates obtained in the pre-regularized filter are 
unnormalized, since the associated proportionality constants cannot be evaluated. Also, the sampling 
scheme in both cases is more involved than in the standard particle filters. In the pre-regularized method, 
a rejection sampler has to be designed (see also [40]) while in the post- regularized algorithm a mixture 
density has to be sampled. Under a number of regularity assumptions on the transition kernel {Xt}t>o 
and the conditional density of Yt given Xt, upper bounds for the expected value of the total variation 
distance between tt^ and its continuous approximations are shown in |35l Section 6] . 

The analysis in |34j targets the convergence of a sequence of continuous random approximations of 
TTj which are similar to those of the pre-regularized particle filter in |35| . The main difference is the 
use of the Markov kernel of the process {Xt}t>Q for smoothing instead of a generic kernel function as 
in classical density estimation methods. Under adequate regularity assumptions, the convergence in 
probability of the total variation distance (between tt^ and its smooth random approximations) is proved, 
and exponential convergence rates are obtained. The assumptions on which the analysis relies exclude 
some typical transition kernels (e.g., Gaussian) for the process {Xt}t>o- Also, similar to the prc-rcgularized 
particle filtering approach, the filtering density estimates can only be computed up to a proportionality 
constant. 

1.2 Contributions 

In this paper, we analyze the approximation of pt constructed as the sum of properly scaled kernel 
functions located at the particle positions. Kernel methods are the most widely used techniques for the 
non-parametric estimation of probability density functions (pdf's) and, therefore, it seems natural to 
analyze their convergence when applied to the approximate samples generated by particle filters. 

The pdf estimators we analyze are based on generic kernel functions which are only required to satisfy 
mild standard conditions (essentially the same as in classical density estimation theory [32] ) ■ We describe 
how to build approximations in arbitrary-dimensional spaces R'', d > 1, and then analyze their convergence 
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as the number of particles is increased and the bandwidth of the kernels is decreased. In particular, we 
obtain point-wise convergence rates for the absolute approximation errors, both of pt and its dcrivativej^ 
(provided they exist). The latter results can be extended to deduce uniform (instead of point-wise) 
convergence rates, again both for pt and its derivatives. Specifically, we provide explicit bounds for the 
supremum of the approximation error and prove that it converges almost surely (a.s.) toward as the 
number of particles is increased. Our analysis is different from the standard methods in kernel density 
estimation. The latter address the bias and variance of the estimators using approximations based on 
Taylor series (see, e.g., [121 Chapter 4] or Chapter 4]) or Edgeworth expansions [22, which enable the 
asymptotic approximation of the mean integrated square error (MISE) of the density estimate and yield 
expressions involving the number of samples and the kernel bandwidth. We directly obtain convergence 
rates for various estimation errors (not only the MISE), given in terms of a single index that links the 
number of samples and the kernel bandwidth (this link is briefly discussed in Section [3. 3p . 

The uniform (on the support of pt) convergence result can be exploited in a number of ways. For 
instance, if wc let p^ be approximation of pt with N particles, then we can obtain a continuous 
approximation of the filtering measure 7Tt{dx) as tt^ [dx) = p^{x)dx, prove that converges to ttj 
a.s. in total variation distance (as N — > oo) and provide explicit convergence rates. A similar kind of 
analysis also leads to the calculation of convergence rates for the MISE of the particle-kernel density 
estimator p^ . Additionally, we prove that the (random) integrated square error (ISE) of a truncated 
version of p^ converges to a.s. and provide convergence rates. A brief comparison of these results with 
the standard asymptotic approximation of the MISE for kernel estimators built from i.i.d. samples is 
presented at the end of Section l473l 

The convergence in total variation distance of a continuous approximation of the filtering measure tt^ 
was also addressed in [35| and |34| . as mentioned in Scction fl.il Compared to these earlier contributions, 
our analysis guarantees the convergence of the (random) total variation distance a.s. (and with explicit 
rates) rather than the convergence of its expected value (as in [35| ) or its convergence in probability (as 
in [M]). Also, our assumptions on the Markov kernel of the state process {Xt}t>o and the conditional 
densities of {yt|Xt}t>i are relatively mild and simple to check. In particular, our results also hold for 
light-tailed Markov kernels (e.g., Gaussian), unlike Theorems 2 and 3 in [34] . 

The last part of the paper is devoted to some applications of the density approximation method and 
the uniform convergence result. We first consider the problem of MAP estimation. We refer here to the 
maximization of the filtering density, a problem different from that of MAP estimation in the path space 
addressed, e.g., in [23 UU [31] . We first prove that the maxima of the approximation of the filtering density 

^Let us note here that the approximation of derivatives of the filter has received attention recently, related to problems 
of parameter estimation in state-space systems 115] . In the latter context, the filtering pdf is made to depend explicitly 
on a parameter vector 6 = {9i, . . . ,0d), and the interest is in the computation of the partial derivatives dpt/dOi in order to 
implement, e.g., maximum likelihood estimation algorithms [TS]. In this paper, we consider derivatives with respect to the state 
variables in Xt = (^i,t, . . . ,Xd^,t), i.e., dpt/dxi^t- 
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actually converge, asymptotically, to the maxima of the true function pt and then show some simulation 
results that illustrate the use of gradient and Newton search algorithms on the estimated density function. 
Note that the latter algorithms can be "safely" used after we have proved the approximations of the first- 
order and second-order derivates of pt to be converge. 

The second application we describe is the approximation of functionals of pt- We provide first a generic 
result that guarantees the almost sure convergence of such approximations for bounded and Lipschitz 
continuous functionals. Then, we address the problem of approximating Shannon entropies [6], which 
is of practical interest in various machine learning and signal processing problems. The log function is 
neither bounded nor Lipschitz continuous and, therefore, the latter generic result does not apply to the 
computation of entropies. We specifically address this problem by proving first the convergence of the 
particle approximations of integrals of the form J f[x)'Kt{dx) when the test function / is unbounded. Let 
us remark that a large majority of the results in the literature [TBI E [3 |35l [2] refer exclusively to the 
approximation of integrals of bounded functions. Only recently, the convergence of approximate integrals 
of unbounded test functions has been proved [30] , albeit for a modified particle filter and assuming that the 
product of the test and the likelihood functions is bounded. Here, we prove the almost sure convergence 
of the approximations of integrals of unbounded functions for the standard particle filter, assuming only 
integrability of the test function. From this result, we finally deduce the almost sure convergence toward 
of the errors in the approximation of Shannon entropies for densities with a compact support. A numerical 
illustration is given. 

1.3 Organization of the paper 

The rest of the paper is organized as follows. Section [5] contains background material, including a 
summary of notation, a description of Markov state space models and the standard particle [bootstrap) 
filter. A new lemma that establishes the convergence of the particle approximation of posterior 
expectations of unbounded test functions is also introduced in Section [21 The construction of particle- 
kernel approximations of the filtering density and its derivatives is described in Section [31 Our formal 
results on the convergence of the particle-kernel density estimators and the smooth approximation of the 
filtering measure are introduced in Section [31 This includes the point-wise and uniform approximations 
of pt(x), the convergence in total variation distance of the smooth measures tt^ and convergence rates 
for the mean integrated square error and the (random) integrated square error of pf and its truncated 
version, respectively. In Section [51 we discuss applications of the particle-kernel estimator of pt and its 
derivatives. In particular, we consider the problem of (marginal) MAP estimation of the state variables 
and the approximation of functionals of the filtering density, including Shannon's entropy. Finally, brief 
conclusions are presented in Section [51 
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2 Particle filtering 
2.1 Notation 

We first introduce some common notations to be used through the paper, broadly classified by topics. 

d times 

Below, M denotes the real line, while for an integer (i>l, M'' = Kx...xR 

• Measures and integrals. 

— B(R'^) is the cr-algebra of Borel subsets of W^. 

— 'P{R'^) is the set of probability measures over B{R'^). 

— (/, 1^) ~ J f{x)ii{dx) is the integral of a real function f -.W^ ^ M. w.r.t. a measure /i e V(R'^). 

— Take a measure fi e V{«.'^), a Borel set A € B{R'^) and a real function f : W'^ ^ M+. The 
projective product / ★ is a measure, absolutely continuous w.r.t. fi and proportional to /, 
constructed as 

if*,m = ^^j^. (1) 

• Functions. 

— The suprcmum norm of a real function / : R'^ ^ R is denoted as jl/|loo = sup^^j^d |/(x)|. 

— B{M.'^) is the set of bounded real functions over R'', i.e., / e B{R'^) if, and only if, ||/||oo < oo. 

— Cb{R'^) is the set of continuous and bounded real functions over R''. 

• Sets. 

— Given a probability measure /i G 'P{W^), a Borel set A E B{R'^) and the indicator function 

r 1, if a; e A 
\ 0, otherwise ' 

^{A) — {Ia, ^i) — J lA{x)^.{dx) — ^{dx) is the probability of A. 

— The Lebesgue measure of a set A e B{R'^) is denoted C{A). 

— For a set A g R'', A'' = R''\A denotes its complement. 

• Sequences, vectors and random variables (r.v.). 

— We use a subscript notation for sequences, namely Xti:t2 — {^ti, ■ ■ ■ ,Xt2}- 

— For an element x = {xi, . . . ,Xd) G R"* of an Euclidean space, its norm is denoted as 



— The Lp norm of a real r.v. Z, with p> I, is written as \\Z\\p = £'[|Z|'']^/p, where E[-] denotes 
expectation. 
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2.2 Filtering in discrete-time, state-space Markov models 

Consider two random sequences, {Xt}t>o and {Yt}t>i, taking values in R"*^ and K'^" , respectively. There is 
a common probability measure for the pair ({Xt}t>Oi {Yt}t>i), denoted P, that we assume to be absolutely 
continuous w.r.t. the Lebesgue measure. We refer to the first sequence as the state process and we assume 
that it is an inhomogeneous Markov chain governed by an initial probability measure ro G ViM.'^") and a 
sequence of transition kernels Tt : S(R''^) x M.'^"^ — > [0, 1], defined as 

TtiA\xt^i) ^P{Xt€ A\Xt^i = xt-i} , (2) 

where A e 6(R'^-) is a Borel set. The sequence {Yt}t>i is termed the observation process. Each r.v. Yt 
is assumed to be conditionally independent of other observations given the state Xt , meaning that 

P {Yt e A\Xo:t - xo:t, {Yk = VkjkM = P{Yte A\Xt = , (3) 

for any A e B(R'^y). Additionally, we assume that every probability measure 74 S 7^(R'''') in the sequence 

jt{A\xt) = P {Yt e A\Xt = Xt} , AGBiR"^), i=l,2,..., (4) 

has a positive density w.r.t. the Lebesgue measure. We denote this density as gt{y\x), hence we write 
jt{A\xt) = J^gt{y\xt)dy. 

The filtering problem consists in the computation of the posterior probability measure of the state Xt 
given a sequence of observations up to time t. Specifically, for a fixed observation record Yi-^ — yi.t, we 
seek the measures irt & V (R'*'" ) given by 

TTtiA) ^ P {Xt € A\Yi.,t = yi-.t} , t = 0,l,..., (5) 

where A € B{M.'^''). For many practical problems, the interest actually lies in the computation of integrals 
of the form (/, TTt). Note that, for t ~ 0, we recover the prior measure, i.e., ttq = tq. 

2.3 Particle filters 

The sequence of measures {T:t}t>i can be numerically approximated using particle filtering. Particle filters 
are numerical methods based on the recursive decomposition [2] 

TTt ^ gT *m^t-i, (6) 

where : R'^^ M+ is the function defined as g\^{x) = gt{yt\x)^ * denotes the projective product and 
= TtTTt-i is the (predictive) probability measure 

6(A) = rtT:t-i{A) = J Tt{A\x)TTt-i{dx), A e BiR"^). (7) 

Specifically, the simplest particle filter, often called 'standard particle filter' or 'bootstrap filter' (see 
also [H]), can be described as follows. 
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1. Initialization. At time t = draw N i.i.d. samples from the initial distribution tq — ttq, denoted 

71= l,...,iV. 

2. Recursive step. Let = {x[^l'i}n=i,...,N be the particles (samples) generated at time t — I. At 
time t, proceed with the two steps below. 

(a) For n = 1, A^, draw a sample from the probability distribution Tt(-|a;|"\) and compute 
the normalized weight 

„„(«) _ 5f (^i"^) (.g^ 



(b) For n = 1,..., A^, let = if ^ with probability \ k€ {!,..., N}. 

Step 2.(b) is referred to as resampling or selection. In the form stated here, it reduces to the so-called 
multinomial resampling algorithm j211 118) but the convergence of the algorithm can be easily proved 
for various other schemes (see, e.g., the treatment of the resampling step in |7|). Using the samples in 
fl^ — {x["'^}n=i,....N, we construct a random approximation of irt, namely 

1 ^ 

^f(d2:t) = -5]<5,(.,(dxt), (9) 

n=l 

where S (n) is the delta unit-measure located at X-t = . For any integrable function / in the state 
space, it is straightforward to approximate the integral (/, ttj) as 

(/,7r,)«(/,7rf) = ^^/(xi"^ (10) 

n = l 



The convergence of particle filters has been analyzed in a number of different ways. Here we only use 
results for the convergence of the Lp norm of the approximation errors (/, irf) — (/, ttj). 

Proposition 1 Assume that the sequence of observations Yi-t = yi-.T is fixed (with T being some large 
but finite time horizon), g\* G i3(M''=") and {gt*,£,t) > for every t = 1,2,...,T. The following results 
hold. 

(a) For all f e B(R'^-) and any p>l, 

||(/,7rf)-(/,.0||,<^^ (11) 



for t ~ 0,1,..., T, where Ct is a constant independent of N, ||/||oo = sup^^gRd^. |/(x)| and 
the expectation is taken over all possible realizations of the random measure tt^ . In particular, 
limjv^oo |(/,7rf ) - (/,7rt)| = a.s. forO<t<T. 
(b) Let f : W^'' — > M &e possibly unbounded and define ft+i{x) = j f{x')Tt+i{dx'\x). If f and ft+i 
are p-integrable (with p > 2) w.r.t. the measure TTt, i-C, if {fP^'Kt) < oo and [ffj^-^jiTt) < oo for 
<t<T, then 

lim |(/,7rf)-(/,^t)hO a.s. 
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forO<t<T. 
See Appendix for a proof. 

Remark 1 Part (a) of Proposition]^ is fairly standard. A similar proposition was already proved in U6T/ . 
albeit under additional assumptions on the state-space model. Bounds for p ~ 2 and p — 4 can also be 
found in a number of references (see, e.g., \1(A fl^ )- Part (b) establishes the almost sure convergence 
for the approximate integrals of unbounded functions (e.g., for the approximation of the posterior mean). 
A similar result can be found in fSOf . including convergence rates. However, the analysis in \30f is carried 
out for a modified particle filtering algorithm, that involves a rejection test on the generated particles, and 
cannot be applied to the standard particle filter presented in this section. 

3 Particle-kernel approximation of the filtering density 

In the sequel we will be concerned with the family of Markov state-space models for which the posterior 
probability measures {7rt}i>i arc absolutely continuous w.r.t. the Lcbesgue measure and, therefore, there 
exist pdf's pt : M''" [0, +oo), t = 1, 2, . . ., such that 7rt(A) = ^^pt{x)dx for any A e B{W^'=). The density 
Pt is referred to as the filtering pdf at time t. In this section we briefly review the basic methodology for 
kernel density estimation and then describe the construction of sequences of approximations of pt using 
the particles generated by a particle filter and a generic kernel function. The section concludes with the 
discussion on the relationship between the complexity of the particle filter (i.e., the number of particles 
A^) and the choice of kernel bandwidth for the density estimators. 

3.1 Kernel density estimators 

In order to build an approximation of the function pt{x) using a size sample of size N , {x^^^}i=i^,,,^N, 
we resort to the classical kernel approach commonly used in density estimation |42l |44j |42 • Specifically, 
given a kernel function (p : M''^ — > M+, we build a regularized density function of the form 

1 ^ 

^'^(-) = ^E'^(---i"^ (12) 

n=l 

In the classical theory, the kernel function (f) is often taken to be a non-negative and symmetric probability 
density function with zero mean and finite second order moment. Specifically, the following assumptions 
are commonly made |42j and we abide by them in this papej^. 

A. 1 The kernel 4> is a pdf w.r.t. the Lebesgue measure. In particular, 4){x) > Vx' G and 
J 4>{x)dx = 1. 

^In classical kernel density estimation theory [42], the kernel function is often chosen to be symmetric, i.e., to satisfy 
J x<j){x)dx = 0. Such property is not needed for the results to be introduced here, though. 
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A. 2 The probability distribution with density (p has a finite second order moment, i.e., Ci = 
J \\x\\'^(j){x)dx < oo. 

Given a function (p satisfying AH] and it is possible to define a family of re-scaled kernels 

(l)i.{x) = h-'^^(l){h-^x), (13) 

where ft, > is often referred to as the bandwidth of the kernel function. Both the kernel and the bandwidth 
can be optimized to minimize the mean integrated square error (MISE) between the regularized density 
and the target densities [33]. Specifically, the MISE is defined as 



MISE = / E 



\ ri=l 



x~x^r^) 



dx, (14) 



where the expectation is taken over the random sample. Although the MISE given in Eq. (fH|) is 
intractable in general, accurate asymptotic approximations (as N — > oo) arc known 33]. Moreover, if we 
assume that x[^\ ...,a;j^'' is an i.i.d. random sample drawn exactly from pt{x) (beware that this is not 
the case in the particle filtering framework, though), then the MISE is minimized by the Epanechnikov 
kernel [42] 

'^{i~\\x\n ifikii<i 



M^) = n'"'^ ' " " " " " . , (15) 
1^ 0, otherwise 

where v^^ is the volume of the unit sphere in R"*^ . If, additionally, pt (x) is Gaussian with unit covariance 

matrix, then the scaling oi <j)E that yields the minimum MISE is given by the bandwidth |44] 

In our case, pt{x) is not known (it is known not to be Gaussian in general, though) and the random 
sample x')^\ is not drawn from pt{x), so the standard results of [3111331133] and others cannot be 

applied directly and a specific analysis is needed [35l [34] . 

In [35] ■ two different kernel approximations of pt were studied. Using the notation in the present 
paper, they can be written as 



Pt.prei^) « ^^gt{yt\x)<j)l.{x-x^^'^), (16) 

n=l 

for the pre-regularized particle filter, and 

N 

Pt,post{x) = XI w'^"Vi(a; - x^"^), 

n=l 

for the post-regularized particle filter. Note that p^p^ei^) is an unnormalized approximation oi pt{x) (the 
normalization constant cannot be computed in general) . For the post-regularized density estimator it can 
be shown that under certain regularity assumptions |35l Theorem 6.15] 



E 



Pt{x) - P^'^post{x)\ dx I Yi.,t 



a.s. 
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(where the expectation is taken w.r.t. pfpost) when iV — > oo and /i — > jointly. Specifically, the mean 
total variation decreases as 0{N^^ + h?). A similar result can be shown for p^pg^t [351 Theorem 6.9]. 

Remark 2 Although we use the same notation for the particles, Section [Ol the 

sampling/resampling schemes in the pre-regularized and post-regularized particle filters are different from 
the basic 'bootstrap ' filter \^{\ \ '33^ . The pre-regularized filter, in particular, involves the use of a rejection 
sampler. 

Remark 3 The convergence results in \35f for the post-regularized density estimator p^post hold true 
when the following assumptions on the state-space model are guaranteed. 

• The transition kernel Rt{xt-i, A) — J^^ g^* (x)Tt{dx\xt-i) is mixing \35l Definition 3.2]. 

• The likelihood satisfies that swp^fzyj-i.i \\u\\'^i ^ W^'^ is the Sobolev space of functions 
defined on M.'^'^ which, together with their derivatives up to order 2, are integrable with respect to the 
Lebesgue measure, and \\ ■ \\2.1 is the corresponding norm. 

• The measure Tt{dx\xt-i) is absolutely continuous w.r.t. the Lebesgue measure, with density 
Tt'-'{x) e W2^i andsup^^_^gR<i, 11^^"' lb,! < 00. 

Assuming that Tt ^ t for every t > 1 (hence, the Markov state process is homogeneous), the analysis 
in |34j targets the convergence in total variation distance of the continuous measure {x)dx, where the 
density estimator is defined as 

N 



„N 



(x) = ctY,9tiyt\xy^"-'{x) 



Pt 

?l=l 

with normalization constant C( = (^J2n=i 1 9tiyt\x)T^*-'^ {x)dx^ . This is similar to the pre-regularized 
approximation p^p^^, but using the Markov kernel of the model, r, for smoothing, instead of the generic 
kernel Although in most problems it is possible to draw from r^'-"^, it is often not possible to 

h 

evaluate it and, in such cases, the approximation p^ is not practical. Also note that p^ is not a 
kernel density estimator of pt in the classical form of Ec^. (|12l) . The sample of size N from which 
the approximation is constructed corresponds to the variable Xt-i, rather than Xt, and smoothing is 
achieved by way of a prediction step (using the Markov kernel r). It is not possible, in general, to write 
pf{x) oc X^^iLi 9t{yt\'^)4> (3; — 2;|"\) for some kernel function 0. Under regularity assumptions on gt and 
T, it is proved in [Ml Theorem 2] that 

|pf (x) -pt(a;)|dx > e| < Cicxp{-C2A^}, t > 1, (17) 

for any e > and some constants ci, C2 > 0. 

Remark 4 The regularity assumptions on the state-space model in \34\ Theorem 2] are the following. 
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(a) There are pdf's {bt}t>i and two constants < Cr < Cr < oo such that 

Crbt{x) < r^'-i(x) < Crbt{x), for all x,t. 

(h) The likelihood gt satisfies t/iat supj->j.^ ^,gjjd^.j,gRtij, g^lyjx') < 

The assumption in (a) excludes, e.g., models of the form Xt = h(Xt-i) + Vt where the function 
h : R''== — > is not hounded or the noise process Vt is Gaussian \34\ Section 4-^]- The assumption 
in (h) is also stronger than required for Proposition]^ to hold true. 

3.2 Approximation of the filtering density and its derivatives 

We investigate particle-kernel approximations of pt constructed from a kernel function 4> and the samples 
x\ , n = 1,...,N, generated by the particle filter. Instead of restricting our attention to procedures 
based on a single kernel, however, we consider a sequence of functions 4)k ■ K''^ — > k E N, defined 
according to the notation in Eq. (fO| . i.e., (j>k{x) = ¥^^(t){kx). If (/) complies with A[l]and A[21 then we 
have similar properties for 0^. Trivially, (t>k{x) > Vx e and it is also straightforward to check that 
J (f>k{x)dx — 1. Moreover, if we apply the change of variable y — kx and note that dy = k'^'^dx, then 

|2 , / N , 1 /■ ,, 1,2 ,/ X , C2 

J ' 

from AH 

PP 

denoted as pj' and has the form 



(j)k{x)dx ^ — / ||y|| (j){y)dy ~ 



(n) 

The approximation of pt generated by the particles x\ \ n — 1,...,A^, and the k-th kernel, (j)k^ is 



n=l 



where = ipkix ~ x'). Beware that, in our notation, we skip the dependence of on the number 

of particles, N , for the sake of simplicity. In Section 14.11 we will assume a certain relationship between 
N and k that will be carried on through the rest of the paper and justifies the omission in the notation. 
Let us also remark that we do not construct p^ in order to approximate integrals w.r.t. the filtering 
measure (this is more efficiently achieved using Eq. (IIOD ). Instead, we aim at applications where an 
explicit approximation of the density pt is necessary. Some examples are considered in Section [S] 

In order to investigate the approximation of derivatives of pt, let us consider the multi- index 
a — («!, a2, . . . , Q!d^) e N* X N* X . . . X N*, where N* = N U {0}, and introduce the partial derivative 
operator I?" defined as 

/^ai . . . /9"<ix h 



dxr---dxZ^ 

for any (sufficiently differentiable) function h : M.^" — > M. The order of the derivative D^h is denoted as 
\a\ — '^■i- ^^'•^ interested in the approximation of functions D"pt{x) which are continuous, as 

explicitly given below. 
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A. 3 For every x in the domain of pt{x), D°'pt{x) exists and is Lipschitz continuous, i.e., there exists a 
constant Ca.t > such that \D'^pt{x ~ z) — D°'pt{x)\ < Ca^t\\z\\ for all x,z G W^" . 

For the same a, we also impose the following condition on the kernel (j). 

A. 4 D^cj) € CbiM.'^"), i.e., D^cj) is a continuous and bounded function. In particular, |l-D"</)|loo = 
sup^gRd, \D"(j){x)\ < OO. 

Remark 5 Trivially, ifD^cjie Ch(R'*-) then D°'(f>k G Ct(M''-) for any finite k. In particular, 

|li?"0fei|oo=fc''^+l"l||^?"0||oo. 

The approximation of D"pt computed from the samples x\^\ n = 1,...,N, and the fc-th kernel, (pk, 
has the form 

1 ^ 

Dyn^) = ^E^"<^^(^i"^) = (^"'^fc-'^f )• (18) 

n=l 

3.3 Complexity of the particle filter and choice of kernel bandwidth 

In the sequel, we will be concerned with the convergence of the sequence of approximations {D°'pf}k>i 
under the generic assumptions A1T]-A|31 All the convergence results introduced in Sections S] and [S] are 
given either as limits, for k — > oo, or as error bounds that decrease with k. 

Recall, however, that Pf{x) = ((/)^,7r^), i.e., the density estimator p'l depends both on the kernel 
bandwidth h = ^ and the number of particles N. A distinctive feature of the analysis in Sections |4] and 
[S]is that it links both indices by way of the inequality N > k^idi^+lal+i] ^ where \a\ — X]f=i '^i ^^ic order 
of the derivative D". For a = (0, . . . , 0), D^p^ ~ and 

iV> fc2(rf.+i). (19) 

Obviously, k oo implies that N oo and h ^ 0. 

This connection is useful to provide simple bounds for the approximation errors, but also because it 
yields guidance for the numerical implementation of the density estimators. In particular, for \a\ ~ and 
a fixed kernel bandwidth h = j:, the inequality in (jl9p determines the minimum number of particles N 
that are needed in the particle filter in order to guarantee that convergence, at the rates given by the 
Theorems of Sections |4] and [5l holds. A lesser number of samples (i.e., some N < fc2(d^+i)-j -^^ould result 
in an under-smoothed density Pf{x) with a bigger approximation error. 

If the computational complexity of the particle filter is limited by practical considerations, then N 

1 

is fixed and the error bounds to be introduced only hold when k < iV^<''^+i' or, cquivalently, when the 

, 1 

kernel bandwidth is lower-bounded as h = j: > N 2(da;+i) ^ A smaller bandwidth would, again, result 

in an under-smoothed approximation Pt{x). On the other hand, since over-smoothing also increases the 

approximation error of kernel density estimators |42j . it is convenient to choose the smallest possible 

1 

bandwidth h. For fixed A^, we should therefore select h = N ^cd^^+i) _ 
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4 Convergence of the approximations 

Starting from Proposition [U we prove that the kernel approximations of the filtering pdf, Pt{x), and its 
derivates converge a.s. for every x in the domain of pt, both point-wise and uniformly on R''^. We also 
prove that the smoothed approximating measure Tt^^'^\dx) = p^{x)dx converges to tt^ in total variation 
distance and that the integrated square error of a sequence of truncated density estimators converges 
quadratically (in k) toward a.s. Explicit convergence rates for the approximations are given. 

4.1 Almost sure convergence 

In this section we obtain convergence rates for the particle- kernel approximation D°'p^{x) of Eq. ^TE\\ . 
Depending on the support of the density pt{x), these rates may be point-wise or uniform (for all x). In 
both cases, convergence is attained a.s. based on the following auxiliary result. 

Lemma 1 Let {^'" j^.^j^ he a sequence of non-negative random variables such that, for p > 2, 

E[{e'^r]<j^, (20) 

where c > and < < 1 are constant w.r.t. k. Then, there exists a non-negative and a.s. finite random 
variable , independent ofk, .such that 

< (21) 

where ■^-^ < £ < 1 is also a constant w.r.t. k. 
Proof: See Appendix [Bl 

Remark 6 In Lemma{^ if the inequality (j20p holds for all p > 2 then the constant e in (|2ip can be made 
arbitrarily small, i.e., we can choose < e < 1. 

Using Lemma [1] it is possible to prove that D°'p^{x) — > D°'pt{x) a.s. and obtain explicit convergence 
rates. In order to establish a connection between the sequence of kernels 4>k{x), fc g N, and the sequence of 
measure approximations Trf , iV G N, we define the number of particles to be a function of the kernel index 
and denote it as N{k). To be specific, for a given multi-index a, we assume that N{k) > fc2(d.+l«l+i). In 
this way, all the convergence rates to be presented in this paper are primarily given in terms of the kernel 
index fc. Wc first show that D"p^ D"pt point-wise for x G 

Theorem 1 Under assumptions A\^and Nik) > fc2(d,+|Q|+i)^ inequality 



\Dy1{x)-Dyt{x)\<'^^ (22) 



holds true, with V^'"'^ an a.s. finite, non-negative random variable and a constant < e < 1. In 
particular, 

lim \D°'p'^{x)- D°'pt{x)\^0 a.s. (23) 
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Proof: Let us construct an approximation oi pt{x) using the kernel 0^ and the true filtering measure tt^, 
namely, Pf{x) = ttj). Since TTt{dx) ~ pt{x)dx, the approximation is actually a convolution integral 
and can be written in two alternative ways using the commutative property, namely 

Ptix) = J 4>kix - z)pt{z)dz = J (j)k{z)pt{x - z)dz. (24) 

Let us now consider the derivative D^pt . If we apply the operator to p^ in ((M)) we readily obtain 

D'^p'^ix) = Uk{z)D^pt{x-z)dz 



and, using the latter expression, we can find an upper bound for the error \D°'pt^ {x) — D"pt(x)|. In 
particular, 



D'-pt^x) - D'-ptix) 



(j)k{z)D"ptix - z)dz - D"ptix) 
cj)kiz)\D''pt{x ~ z) ~ D"pt{x)\dz 

< Ca,t / (l)kiz)\\z\\dz 



< 



< Co 



(f)k{z)\\zPdz 



Ca.t 



y/C2 



(25) 

(26) 
(27) 
(28) 



where Eq. follows from AH] (namely, cj) > 0), (pS)) is obtained from the Lipschitz assumption A131 
(P7)) follows from Jensen's inequality and, finally, the bound in ([25]) is obtained from assumption 
Note that Ca^t and C2 are constants with respect to both x and k. As a consequence of 



lim D"p';{x)=D"pt{x). 

Consider now the approximation, with N{k) particles, D°'p'l = (D"(/>^, Trf^'^'^-') of the integral 
{D°'4r^, TTt). From Proposition [T] and assumption A|3]we obtain 



D'^p1{x)-D'^p'1{x)\\ = {D^rk,^^^"'') - {D''<t>l.^t) 



< 



where we have used Remark [5] and the constant cj is independent of N(k) and x. 
A straightforward application of the triangle inequality now yields 



(29) 



iD^'p'^ix) ~ D"pt{x)\\^ < \\Dyl{x) - (x)!!^ + WD^'plix) - Oytix) 



(30) 



Ip — II ^ i V ' - <■ ' Up II ^ i ^ ' ' " ^ ' lip 

The first term on the right-hand side of ([50]) can be bounded using l^^ . while the second term also has 
an upper bound given bjlfl (j28p . Taking both bounds together, we arrive at 



\D'^p^,{x)~Dyt{x)\\^< 



Ct 



./Nik) 



k - k ' 



(31) 



'Note that ||D"pf'(a;) - D°'pt{x)\\^ = |D"pJ'(x) - D^ptix)] because D^pt (x) does not depend on vr 



N(k) 
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where the second inequahty fohows from the assumption N{k) > fc2(da;+|a|+i) ^^^^ „ 
Ca,t^/C2,a IS a Constant. 

The inequahty (j3ip immediately yields 



D'^p^ix)-Dy,ix)\'' <^ (32) 



E 

and we can apply LemmafU with 6''"' = \D°'p^{x) — D°'pt{x)\, 1^ = and arbitrarily large p > 2, to obtain 

\DyUx)-D'^Pt{x)\<y^, (33) 



where V"''"'^ is a non-negative and a.s. finite random variable and < e < f is a constant, both of them 
independent of k. The limit in Eq. psp follows immediately from the inequality ([33]). 

□ 

Remark 7 The constant Ca,t of Eq. (|3ip is independent of the index k and the point x E R''='' . The random 
variable V"'^'"^ is also independent of the kernel index k, as explicitly given by Lemma{J\ However, it may 
depend on the multi-index a and the point x where the derivative of the density is approximated, hence 
the notation. 

Remark 8 For a ~ (0, . . . , 0) = 0, the inequality implies that that we can construct a particle 

approximation of pt{x) that converges point-wise. In particular, D^pt{x) = pt{x) and D^p^{x) ^ p^{x) = 
((/)^, Trf^'''"^), hence Eq. 1^) becomes 

lim |pf(x) -pt(x)| = a.s. (34) 

for every x £ R"*^ . 

Remark 9 The proof of Theorem [7] does not demand that the assumptions ^[7] and N{k) > 

p(d,+|a|+i) hold for every possible a, but only for the particular derivative we need to approximate. For 
instance, if we only aim to approximate pt{x) (i.e., a ~ 0), assumption A\^implies that the distribution 
with density (j) must have a finite second order moment, assumption means that pt must be Lipschitz, 
assumption implies that the basic kernel function (j) must be continuous and bounded, and it suffices 
that the number of particles satisfies the inequality N(k) > fc2(rfa;+i)^ 

Most of the results to be given in the remaining of this paper are conditional on the assumptions A [TJ 
A\M A\^andN{k) > pid.+\o.\+i) ^ the same as Theorem]^ However, they refer only to properties of 
Pt and its first order derivatives and, as a consequence, it is enough to assume that A\^and A^hold true 
for a = and a = 1 = (1, . . . , 1) alone. For the same reason, it suffices to assume N{k) > fc2(2d^+i)^ 

Through the rest of the paper, we say that the "standard conditions'' are satisfied when 

• Am and A\^hold true; 
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• and hold true for, at least, a = and a — \; and 

• N{k) > fc2(2d. + l). 

If we restrict x to take values on a sequence of compact subsets of R''^ , then we can obtain a convergence 
rate for the error \pt{x) ~ pt{x)\ that is uniform on x, instead of point- wise hke in Theorem [T] For the 
foUowing resuh we fix p > 2 and consider the sequence of hypercubes 

/Cfc = [-il/fc, +Mk] X ... X [-A4, +Mk] C M''% 

where Alk = Mk'^^p , and M > and < /3 < 1 are positive constants independent of k. Note that, for 
any fixed p and /3 > 0, hmfc_j.oo ICk = K''^ . 

Theorem 2 If the standard conditions are satisfied, then 

sup \pt{x) -Ptix)\ < 

xeKk 

where > is an a.s. finite random variable and < e < 1 is a constant, both of them independent of 
k and x. In particular, 

fim sup \pf {x) — pt{x)\ = a.s. 

fc-i-oo xeKk 

Proof: For any x = {xi, . . . ,Xd^) G /Cfc and a function / : R'*^ — ^ M continuous, bounded and 
differentiable, 

■■■ D^f{z)dz- / ••• / D^f{z)dz. 

^Mk J-Mk J-Mk J-Mk 

In particular, for Xi £ [— Mfe, Mfe], i = 1, dx, and the assumption AQlwith a = 1, 

\pUx) - Pt{x) \ < 2 I ■ I \D^p1{z) - D^pt{z)\dz + \p^t{0) - pt{0)\ (35) 
and, as a consequence, 



sup \p^ (x) - pt {x) I < 2A^ + \p^^ (0) - (0) I , (36) 

x^Kh 



where 

= / ••• / |i?^Pt"(^)-7^V(^)|rf2. 

An application of Jensen's inequality yields, for p > 1, 

^M<::TTT;r/ ■••/ \D^p';{si{z)) - D^pi{s,{z))\Uz, 



hence 



Since 



£=0 





l-Mk 


lo 


Jo 



(^fc)P<2^^(p-i)Af,^-(P-i) ^ / .../ iD^pliseiz)) - D^ptiseiz))]" dz. (37) 



|l?M-(^K^))-^'pt(^£W)r] <^ (38) 
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from inequality (j3ip in the proof of Theorem [U we can combine ([55)) and p7[) to arrive at 



where the equahty follows from the relationship = ^k-^^p. Using Lemma [T] with 9^ = , p > 2, 



2' 

V = P and c = J, we obtain a constant £i e {~^^ ^'^^ ^ non-negative and a.s. finite random variable 
YA,ei ^ i3Q^}j Qf them independent of fc, such that 



Since, from Proposition [U 



A'' < — . (39) 



E 



\p^t{x)-pt{x)?' 



kp ' 

we can apply Lemma [T] again, with 9k = (0) — Pt(0)|, p > 2, v and c = c^^lo obtain that 

\pt{Q)-pm\<^^. (40) 

where £2 € (p'"*^) ^ constant and jg non-negative and a.s. finite random variable, both of 

them independent of k. 

If we choose e = ei = £2 e \ ^-^, 1 ) and define = V^'^^ + ]/Pt(o),£2^ then the combination of Eqs. 



Ml), dSH) and go]) yields 



sup 1^4*^(2;) -pt(x)| < — 



where U' is a.s. finite. Note that and e are independent of k. Moreover, we can choose p as large as 
we wish and /3 > as small as needed, hence we can select e e (0, 1). Therefore, the proof is complete. 

□ 

Remark 10 Assuming that A\^ and A^ hold for the multi-index a' = a + 1, the argument of the proof 
of Theorem\^ can also be adapted to show that 

sup \Dy1{x)-D"pt{x)\<^, 

where the constant < e < 1 and the a.s. finite random variable > are independent of k. 

Remark 11 Theorem\^ also holds for a fixed compact subset JC C M''=" instead of the sequence ICi, IC2, . ■ ■ 
In particular, the presented proof is still valid for a fixed hypercube JC ~ [—A/, +M] x . . . [—M, +M] (i.e., 
with (3 = and fixed M > in the definition of Mk). Therefore, 

sup \p^^ {x) - pt [x) I < —J , (41) 

where the constant < e < 1 and the a.s. finite random variable > are independent of k. 
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4.2 Convergence in total variation distance 

The total variation distance (TVD) between two measures Hi, IJ,2 € 'P(K'*) on the Borel cr-algebra B{M.'^) 
is defined as 

drvip-iilJ-^) = sup \^ii{A) - ^i2{A)\. 

Correspondingly, a sequence of measures /i" € ViW^) converges toward /i G 'P{W'') in TVD when 
lim„_yoo dTvip-^ifJ-) ~ 0. It can be shown that, if /x" and fi have densities w.r.t. the Lebesgue measure, 
denoted and q, respectively, then 

dTy(A^",A^) = i / \q''{x)^q{x)\dx 



2 _ 

and, therefore, the sequence ^" converges to fi in TVD if, and only if. 



lim / |g"(x)-g(x)|dx = 0. (42) 

n— ^oo y 

Consider the smooth approximating measures 

TT^''''\dx) =p'l{x)dx, fc = l,2,... 

In this section we show that the sequence tt^^*'^ converges toward irt in TVD, as A: — c», by proving first 
that J |p( — pt \ dx ^ under the same assumptions of Thcorem[2] This result is established by Theorem 
[3] below. The same as in the proof of Theorem [51 we consider an increasing sequence of hypercubes 
/Ci C . . . C /Cfc C . . . C R'*-, where ICk = [-A4,+Affe] x ... x [~Mk,+Mk] and Mk = , with 

constants < /? < 1 and p > 3. Also, recall that, for a set A G M'', = M''\A denotes its complement 
and, given a probability measure iJ. G 7'(M'^), n{A) = ij.{dx) is the probability of A. 

Theorem 3 If the standard conditions are satisfied and TTt{IC\) < where b > and 7 > are 

arbitrary but constant w.r.t. k, then 



|p?(x) — Pt{x) \ dx < - — ^-r, T, 

l-f t \ / -dV /\ l,min{l— £,7} ' 



where > Q is an a.s. finite random variable and Q < e < 1 is a constant, both of them independent of 
k. In particular, 



lim / \p^{x) — pt{x)\dx — Q a.s. 



and, as a consequence. 



lim drviT^t i^t) = a.s. 



.Nik) 

Proof: We start with a trivial decomposition of the integrated absolute error, 
Ipf (.) - . / |rf(.) - + / , |rf(.) - 

\pt (x) - pt{x)\ dx + 2 / pt{x)dx+ / {pt{x) - pt{x)) dx, 



< 



K-k 
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where the equahty follows from ICk U /Cj, = M''^ and the inequality is obtained from the fact that pt and 
p^ are no n- negative. Moreover 



{P^tix) -Pt{x)) dx 



{Pt{x) - Pt{x)) dx 



< I \Ptix)-Ptix)\dx 



hence 



p'l{x) - pt{x)\dx <2 i \p'l{x) - pt{x)\dx + 2 i pt{x)dx 



The first term in (j43p can be bounded by 



\pt{x) ~ pt{x)\dx < C{K,k) sup \p^{x) - pt{x)\ 



(43) 



(44) 



where CQCk) = (2Mfc)''- = fcp is the Lebes gue measure of /C^. From Theorem [5J the supremum in (|44p 
can be bounded as sup^g^i^^ \p't {x) ~ Pt{x)\ < V^^ /k^~^^ , where V^^ > is an a.s. finite random variable 
and < £i < 1 is a constant, both independent of k. Therefore, the inequality (pi)) can be extended 
to yield 



\p'i{x) -pt{x) \ dx < 



ICk 



where e = ei + ^ and V = V^^ . If we choose £i < 1 - |, then s G l) • Note that, for l3 < 1 and 



p>3, 1 



I _ 1+R 
p p 



> 1 — ^ > 0, hence both ei and e are well defined. 



For the second integral in Eq. (j331)> note that J^t Pt{x)dx — 7rt(/Cj,) and, therefore, it can be bounded 
directly from the assumptions in the Theorem, i.e.. 



2 / pt{x)dx < bk-'^, 

-t 



(46) 



where & > and 7 > are constant w.r.t. k. Putting together Eqs. (|43l) . (|45|) and (|46)) gives us the 
result. 

□ 



4.3 Integrated square error 

A standard figure of merit for the assessment of kernel density estimators is the mean integrated square 
error (MISE) US] ■ If we assume that both pt (x) and the kernel (j){x) take values on a compact set 
/C, then it is relatively simple to prove that the MISE of the sequence of approximations D"p^ converges 
toward quadratically with the index k. In particular, we have the following result. 

Theorem 4 Assume that A\M -4(1 A\^ and N{k) > fc2('^-+l"l+i) hold true. If both pt{x) and the 
kernel 4>{x) take values on a compact set K, C , then 



MISE= / E 

JK 



D'^pUx) - Dy^ix))' 



dx < 



CaX,t 



where Cax,t > is constant w.r.t. k. 
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Proof: Since any compact set is contained in a larger hypercube, we can choose K, = [—M, +M] x • • • x 
[—A/, without loss of generality. Furthermore, since the assumptions of Theorem [1] are satisfied, we 
can recall the inequality in ([5^ . which, selecting p — 2, yields 



E 

where the constant ^ is independent of k and x. Therefore 

J^E[{Dy',{x) - Dyt{x)Y] dx < ^C{JC) < 



fc2 ' 



where C{JC) = {2MY- is the Lebes gue measure of JC and Ca,K.t — {2M)'^^ j. 

□ 

It is also possible to establish a quadratic convergence rate (w.r.t. k) for the integrated square error 

(ISE) of a sequence of truncated density approximations. In particular, consider the usual hypercubes 

1 ^ ^ 

Kk — [— A/fc, +A/fc] X • • • X [—Mk, +Affe] with Mk = ^k''=^p , for some p > § and a constant < /3 < 1, and 
define the truncated density estimators 



P:'\x)=I^Jx)p'^{x) 



Pt{x), ifxe/Cfc, 
0, otherwise 



Since limfc_j.oo A^fc ~ , it follows that limfc_^oo \Pt'^{^) ~ P^{^)\ — ^-nd we can make pl'^ arbitrarily 
close to the original approximation. The theorem below states that pl'^ converges a.s. toward pt, with a 
quadratic rate. 

Theorem 5 If the standard conditions are satisfied, pt € B{W^'') and Trt{JCk) < hk~^ , where & > and 
7 > are arbitrary hut constant w.r.t. k, then 



ISE^ J [pj'''{x)-pt{x)y < 



where > Q is an a.s. finite random variable, independent of k, and < e < 2 is an arbitrarily small 
constant. In particular, 



Proof: We start with the trivial decomposition 



lim / (^pj'^'lx) — pt{x)j dx — a.s. 



2 /* 2 /* 2 

{Pt'''ix)-Ptix)j dx ^ (pj'''{x) - pt{x)j dx+ l(^pj'''{x)-pt{x)j dx, (47) 



where ICj, = M.'^'^\JCk is the complement of ICk, and, expanding the square in the last integral of Eq. (|T7)) . 
we obtain 



Pt^'^i^) ~Pt{x)] dx = (pj'''{x) - pt{x)\ dx + j {pt{x) - pj {x)\ pt{x)dx 

+ / [pj^''{x)~pt{x))pj^\x)dx. (48) 
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In the rest of the proof, we compute upper bounds for each of the integrals on the right-hand side of Eq. 

dS). 

For the first term in (pS)) we note that pj'''{x) — p^{x) for aU x G K,k, hence 

/ (pl^\x)-Mx)f dx^ I {p'i{x)-pt{x)f dx<C{JCk){snY> pl'\x)~pt{x)\ , (49) 

where C{ICk) = (2Affc)'*- = kp. Using Theorem d we obtain an upper bound for the supremum in Eq. 
(|49l). namely sup^^^^ IpH^) ~ Ptix)\ < V^^ /k'^-^\ where > is an a.s. finite random variable and 

1+/3 
P 

(PI) as 

" {pj'\x)^p,{x)ydx<kl^^ = -^, (50) 

where e = 2ei + | and U" = (V^^f. If we choose ei < 1 - ^, then s € (^^^,2j . Note that, for /3 < 1 
and p > |, 2 - 2±M > hence e is weU defined. 

For the second term in the right-hand side of Eq. (j48p we simply note that pl'^{x) = for all x G 1C\. 
andpt(a;) < ||pt||oo < oo, since pt G BlW^"). Therefore, 



p < £i < 1 is a constant. Both V^'^ and ei are independent of k. We then extend the inequality in 
I as 

[Pt y-^ ) ~ Ptyj^ ) ] "^'r^ft'" ^ 



{pt{x) - pj '^'{x)^ Pt{x)dx < \\pt\\oo / Pt{x)dx =\\pt\\ooT^t{K-X), 

and using the assumption 7rt(/Cj^,) < bk~'' we obtain 

(p,(x)-p7''^(x))p,(x)dx< (51) 



The third term is trivial. Since pj'''{x) = for all x G K.^, it follows that 

^ (p7''^(x)-p,(a;))p7'^(x)=0. (52) 
Substituting Eqs. ^ and ^ into Eq. (051) yields 

where C/" = L/" + fe|bt||oo and < e < 2. 
□ 

Remark 12 A classical asymptotic approximation of the MISE (AMISE) for kernel density estimators 
built from i.i.d. samples is (see, e.g., \22^ and note that we restrict ourselves to diagonal bandwidth 
matrices) 

AMISE^hM^,Po) + ^, (53) 

where h > is the bandwidth parameter, c{(f>,Po) > is a constant that depends on the kernel (j) and the 
target density (denoted po here and assumed twice differentiable) , c{(j)) > is another constant depending 
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on (p alone and N is the number of samples. If we substitute /i = 1/fc and N = fc^''^+^ , as given by our 
analysis, into the expression above, then we find that the MISE converges asymptotically as , for 

some constant c{(j),Po) > 0. We note, however, that 

• Eq. \5S\) is only an asymptotic approximation of the MISE, whereas Theorems j^] and [3| give actual 
upper bounds for the MISE and the ISE that are valid for every k; 

• the AMISE of Eq. h5S^) is derived under the assumption that a size N sample drawn from the density 
Pt is available \44\ , whereas Theorems [^J and hold true for the smoothing of any random measure 
TT^ that satisfies ||(/, tt^) — (/, 7rt)||p < for some constant c and f G B(R'^^). 

4.4 A simple example 

There are several possible choices for the kernel function (j>{x) that comply with assumptions Alljand Al2j 
In particular, the standard multivariate Gaussian density with unit covariance, 



(27r 



the (ia;-dimensional Laplacian pdf, 

where b ~ ^ wk-, and the Epanechnikov kernel (j)E{x) of Eq. are densities with bounded second order 




2d^ 

moment. 

It is also straightforward to check assumption A|4]for a = and a = 1. In particular, for a = 0, it 
is apparent that (j)G, 4'Lt4>e G C&(M''"=). For a = 1, the partial derivatives of the Gaussian and Laplacian 



kernels yield 



DVg(x) = T-^II-^iCxpJ -l^a^n ^^^^ 



2 




D^(j)L{x) = ^d:^exp<( -- y^lxjl ^ , XT^O, 

respectively, where ri,+ = |{/ e {1, . . . , dx} ■ xi > 0}| is the number of positive elements of x G M''"'. It is 
not hard to verify that D^(I)q e C6(R'^==), while D^4>l G BiM.'^"). As for the Epanechnikov kernel, it is 
easy to show that D^(I)e{x) = V.t e . 

In the sequel, we consider a simple example consisting in the approximation of a Gaussian filtering 
density using the Epanechnikov kernel. 
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Example 1 Consider the state-space system 

poixo) = NixQ;0,l2), Xt = AXt-i + Ut, Yt = BXt + Vt, t = l,2,... (54) 

where N{xq; 0,l2) *s the bivariate Gaussian pdf with mean and 2x2 identity covariance matrix, T2; the 
matrices A,B <E M^^^ are 

0.50 0.30 " 
-0.80 0.20 J ' 

and Ut, Vt, t — 1,2, are sequences of independent and identically distributed 2x1 Gaussian vectors with 
zero mean and covariance Z2. For this class of (linear and Gaussian) models the filtering pdf pt, t > I, 
can be computed exactly using the Kalman filter fg^J / and, therefore, we have a reference for comparison 
with the approximations p^ produced by the particle filter with N{k) = fc2(tia,+i) _ ^6 gamples. 

For the simulation, we generated two sequences, xq, xi, . . . , xt and yi, . . . ,yT for T = 50, according 
to the model (j54p . Then, using the fixed data yi-.T, we run a Kalman filter to compute the Gaussian 
pdf pt{x) = N{x;xt,'^t) exactly, where xt and St cire the posterior mean and covariance at time T, 
respectively. For the same sequence yi.T, we run independent particle filters with various values of k and 
N(k) — fc® particles each. 

Figure[J\ shows plots of the approximations p^ix) for k = 4, 7, 10 (constructed using the Epanechnikov 
kernel, 4>e) and the true pdf pt{x). The plots are drawn from a regular grid of points in M.^, namely 

X e Gt = {ixi,X2) ■■ xi = -2.916 + 0.2n, = -3.536 + 0.2n, n = 1, 2, . . . , 42} (55) 

(the offsets —2.916 and —3.536 correspond to the true posterior mean of Xt). We can see that there is 
an obvious error for small k, while for k ~ 10 the difference between Pt{x) and its approximation is 
negligible. 

5 Applications 

We illustrate the use of the convergence results in Section [3] by addressing two application problems: 
the computation of maximum a posteriori (MAP) estimators and the approximation of functionals of 
the filtering density, pt. All through this section, wc implicitly assume that the standard conditions of 
Remark [5] are satisfied. 

5.1 MAP estimation 

We tackle the problem of approximating the maximum a posteriori (MAP) estimator of the r.v. Xt. In 
particular, we address the numerical search of elements of the set 

St = arg max pt{x), (56) 



0.50 
0.39 



-0.35 
-0.45 



B = 
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(a) fc = 4 and N{k) = 4, 096 



(b) fc = 7 and N{k) = 117, 649 




(c) fc = 10 and N{k) = 1, 000, 000 (d) pt{x) (exact) 

Figure 1: Plots (a)-(c) display the approximations of the filtering density produced by the particle 
filter, p!^{x), with an increasing number of particles N{k) — k^, and an Epanechnikov kernel, (j)E- 
The true pdf, pt{x), is shown in Plot (d) for comparison. The plots correspond to the discrete grid 
Gt in Eq. ([55]). 
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where s € St if, and only if, pt{s) = msx^^g^d^ pt{x) . Note that this is a relevant problem since MAP 
estimates arc often used, e.g., in signal processing and engineering applications (see, e.g., [Ml ISZl [53] ) , 
and the density pt (x) cannot be analytically found in general. 
Let 

= arg max (.t) 

be the set of MAP estimates for the approximation density p^ (x) and note that Xk S if, and only if, 
Pf{xk) = maXj.gj{d^ Pti^)- We can build a sequence of approximate estimates, denoted {xk}k>i, by taking 
one element from each set 5'^'^, fc = 1, 2, at time t. li St is nonempty, then any convergent subsequence 
of {xk}k>i yields an arbitrarily accurate approximation of a true MAP estimator s S S't, as stated below. 



Theorem 6 Assume that St ^ ^ and take any convergent subsequence o/ {ife}fc>i, denoted {xki}i>i- 
Let X ~ Ymii^ooXki he the limit of such subsequence. If pt G C'b(M'^"=), then pt{x) = maji^^-g^d^ pt{x) . In 
particular, if pt{x) has a unique maximum, then St is a singleton and limi^oo Xk^ ~ argmaXj.gjjda; pt{x). 

Proof: We prove the theorem by contradiction. Specifically, assume that pt{x) < max^-g^dj; pt(x). Then, 
choose some s G S't, so that pt{s) = max^ggd^ Pt{x) and pt{x) < pt{s) and let 

.^EM^EM^O, (57) 

Now, choose a compact subset K, C M.'^" that contain s, {xki}i>i and x. From Remark [TTl 
oo ^^Paj^A;;; Pt('^)| — lience there exists tti sucli tliat for all k ^ ui 

sup |pj^(x) -Pt{x)\ < e. (58) 

Moreover, since pt{x) is continuous at every point a; G /C, we can choose an integer zq such that for all 
I > zo we obtain 

\pt{xki) - Pt{x)\ < e. (59) 
Now. choose an index £ such that £ > io and £ > m. Then, for every i, ki > £, we have 



' ^ , ^ 

v\' (ifei )-v\"k^) = v\' (xki ) - Pt {xk^ ) + Pt (xki )-pt{x) 



= -3 

k 



+ Pt{x)-pt{s)+pt{s)~p1'{s) <0, (60) 

where the first term on the right-hand side, p^^ {xki) — Pt{xki) < e, follows from inequality (|58D . the second 
term, pt{xki) — pt{x) < e, follows from inequality ([5^ . the third term, pt{x) — pt{s) ~ — 3e, is due to 
the definition in (j57p and for the fourth term, pt{s) — pj^'(s) < e, is obtained from the inequality (j58p . 
Therefore, Xki ^ argmax^gjjd^ Pt^ix) and we arrive at a contradiction. Hence, pt{x) ~ max^^gRd^ Pt{x). 
□ 
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Remark 13 Note that the whole sequence {xk\ may not converge to a MAP estimate since it may, e.g., 
alternate between different elements of St . 

Many global optimization algorithms, such as simulated annealing [S1I2H] or accelerated random search 
[I], rely only on the evaluation of the objective function and Theorem |6] justifies their use with the 
approximation Many other optimization procedures are based on the evaluation of derivatcs of the 

objective function. For example, we may want to use a gradient search method to find a local maximum 
of pt{x), i.e., to find a solution of the equation 



VxPt{x) = 0, 



(61) 



where x = (.ti , . . . , Xd^ ) and 





dpt 
dx\ 








'^xPt{x) = 


dpt 
dxd^ 









i-th 

with = (0, . . . , 1 , . . . , 0). Let x* be a solution of (IFTI) . i.e., VxPtix*) — 0. Under the assumptions of 
Theorem m for every e > there exists k^ such that, \/k > k^, 

-e < D^'p^ix*) < e a.s.. 



Therefore, 



\^xpHx*)\\ = 



\ 1=1 



and, since e can be chosen as small as we wish. 



lim ||V,pf(a;*)|l =0 a.s., 

k—^oo 

which justifies the application of a gradient search procedure using the approximation of the filtering pdf. 

Example 2 We illustrate the application of a gradient search procedure using the same example as in 
Section \4-4\ particular, we consider the approximation of the maximum of the Gaussian filtering pdf 
Pt{x), T = 50, using a steepest descent method. Given an approximation p^{x) of the filtering density 
constructed with the Gaussian kernel (pQ, we run the iterative algorithm 



XT{i + 1)*= = iT(j)^ + a ^xPt{x)\,^^^^^^). , i = 0, 1, 2, . 



(62) 



with initial condition xt{0)^ = (~2, — 2)^ and step-size parameter a = 0.1. This procedure yields a 
sequence of approximations ^^(l)'^, . . . , xrii)'', ■ . ■ of the MAP estimator xt. Since for the model of Eq. 



27 



i =1 Ci 1 ± 

X2 



^ — ^ 1 ± 

X2 



=4 ^ 6 1 ± 

X2 



(a) With fc = 5 and N{k) = 15, 625. (b) With fc = 9 and N{k) = 531, 441. (c) With the true posterior pdf, pt. 

Figure 2: Trajectories of the gradient search algorithms. Plot (a) shows the estimates produced by 
the gradient search algorithm of Eq. (j62p superimposed over a contour representation of p!^^{x). 
Plot (b) displays the estimates and contour graph for p!^^{x). Plot (c) shows the estimates produced 
by the gradient search algorithm of Eq. (j63p superimposed over a contour representation of pt{x), 
for comparison. 

(|54p it is possible to obtain Pt{x) exactly, we have also run a steepest descent search over the true filtering 
pdf, namely, 

iT(« + l) = i-T(j) + aV,pT(x)|,^j^(,), i = 0,l,2,..., (63) 

that generates the estimates ^^(l), . . . , xrii), ■ ■ ■ for the same initial condition and step size. 

The results, using the same sequence of observations as in Section \4-4\ ^.^^ shown in Figure [H 
Specifically, Figures\^a) and\^b) show the trajectories described by the estimates xt{^)^ , ■ ■ ■ jXri'i)'', ■ ■ ■ 
superimposed over the contour plots of the approximate pdf pi^{x) for fc = 5 and k = 9, respectively (and 
N{k) — k^). For comparison, Figure\^c) depicts the sequence ^^(l), . . . , a;T(i)' • ■ • obtained from the 
search over the true density pt{x), together with the corresponding contour plot. We observe that both the 
pdf's and the trajectories described by the search algorithms are very similar. 

Figure\^ depicts the sequences of estimates ^^(l), . . . , i;T(800) and xt{'^), ■ • ■ , a;T(800) computed by the 
search algorithms with the approximate and true gradients, Vp^(x) and'Vpxix), respectively, component 
wise. We observe that, for fc = 9, the approximation is tight in both dimensions. 

Similar arguments can be given for other optimization methods based on the derivatives of the density 
Pt. A relevant example is the Newton method (see, e.g., [3]) as it serves to illustrate the approximation 
of second order derivates of pt{x). The Newton iteration for the computation of a solution of Eq. ([BT]) 
has the form 

xt(i + 1) = XT{i) - A,pT(x)V,pT(a;)U^£^(,) , (64) 
where IS.xPt{x) is the d^ x dx Hessian matrix of pr(a;). 

Example 3 Again, we use the same example as in Section \4.4\ for illustration and search the maximum 
of the Gaussian filtering pdf pt{x) , T ~ 50, by way of the Newton method. Given an approximation p^{x) 
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Figure 3: Trajectories of the gradient search algorithms (approximate, with k = 9, and exact) on 
the horizontal (upper plot) and vertical (lower plot) planes. 

of the filtering density constructed with the Gaussian kernel (pQ , we run the iterative algorithm 

iT(» + l)" = XT(»)"- A.P^(a;)V,p^(x)|^^.^(^^,, * = 0,1,2,..., (65) 

where both the Hessian matrix and the gradient vector are computed from the approximate density p^{x) 
and the initial condition is xt(O)'^ — (0.4,0.4). This procedure yields a sequence of approximations 
xt(I)'^, . . • ,XT{i)'', ■ ■ ■ of the MAP estimator xt ■ We have also run the Newton iteration over the true 
filtering pdf i.e., the algorithm in (|64[) . that generates the estimates xxi^), ■ ■ ■ ,XTii), ■ ■ ■ for the same 
initial condition. 

The results for the fixed observation sequence yi-x of Section \4-4\ o.'^d Example are shown in 
Figure^ Specifically, Figure^a) shows the trajectory described by the estimates XTi^)'' , ■ ■ ■ , xxii)'' , ■ ■ ■ 
superimposed over the contour plots of the approximate pdf p!^{x) for fc = 7 (and N(k) = k^). Figure 
\^b) depicts the sequence xt{1), ■ . ■ yXxii), - ■ ■ obtained from the search over the true density pt{x) for 
reference. 

Figurel^ displays the time evolution of the estimates xxii)^ and xxii), i = 1,...,500, separately on 
each dimension. In general, the approximation of second order derivates is less accurate than that of 
Pt{x) and its first-order derivates for the same value of k. 

5.2 Functionals of pt 

The result of Theorem |3] allows us to construct (rigorous) approximations of functionals of the form 
{f ° PtTT^t), where o denotes composition and / is a Lipschitz-continuous and bounded real function. In 
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(a) With fc = 7 and N{k) = 117, 649. (b) With the true posterior pdf, pt- 



Figure 4: Trajectories of the Newton algorithms. Plot (a) shows the estimates produced by the 
Newton search algorithm of Eq. (j65[) superimposed over a contour representation oi pi^{x) for k = 7. 
Plot (b) shows the estimates produced by the Newton search algorithm of Eq. superimposed 
over a contour representation of for comparison. 
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Figure 5: Trajectories of the Newton search algorithms (approximate, with k — 7, and exact) on the 
horizontal (upper plot) and vertical (lower plot) planes. 
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order to provide rates for the convergence of the particle-kernel approximations (/ o pj^, tt^'''^^), we again 
work with the sequence of hypercubes Kk ~ [—Mk, Mk] x • • • x [— M^, Alk] C M. " where Mk = ^k-^^p and 
0</?<l,p>3 arc constants with respect to k. Specifically, we have the following result. 

Theorem 7 Choose any bounded, Lipschitz continuous function f, i.e., f G _B(M) andyx,y € M 

\fix)^f{y)\<Cf\x~y\, 

for some finite constant Cf > 0. If pt G -B(IR.'^"=) and TTt{ICj.) < ^k"^ for some constants 7, 6 > then 



{fop1,n^^'^)-{fop,,n,) 



< 



/ 



[i{l-e,7} 



(66) 



where < e < 1 is an arbitrarily small constant and is an a.s. finite random variable independent of 
k. In particular, 

lim (/op,^^f("))-(/°Pt,^t) =0 a.s. 
Proof: Consider first the absolute difference 



|(/°Pt\7^t) - (/°Pt,7rt)| 



< 



[if ° Pt )ix) - if o pt){x)] pt{x)dx 
(/opf)(x) - {f opt){x)\ptix)dx, 



(67) 



where the inequality holds because Ptix) > 0. Using the Lipschitz continuity of / in the integral of Eq. 
(|57)) yields 



|(/oPt',7rt) - (/opt,7rt)| < Cf J \pt{x) ~ ptix)\ptix)dx 



(68) 



where the second inequality follows from the assumption pt G _B(M''^) (hence ||pt||oo < 00). Eq. 
together with Theorem |3] readily yields 

c/lbtllooQ" 



|(/°P^7rt) - (/opt,7rt)| < 



^min{l-e,7} 

where < £ < 1 is a constant and is an a.s. finite random variable. 

As a second step, consider the difference 
that ll/opt lloo < II /II 00 independently of k and an application of Proposition [T] yields 



(69) 



(/op^,^f"^')) - (/op^,7rt) . Since / G B(M), it follows 



E 



(/opf,7rf('^')-(/°Pt,^*) 



< 



cUfWlo < C?|l/||^ 



where q > I and the second inequality holds because N{k) > fc2(d^+|i|+i)_ Using Lemma [T] with 
c = 11/11^ and = (note that q{2dx + 1) > 2 for any q,dx > 1), wc readily obtain the convergence 
rate for the absolute error, i.e.. 



ifoplnr^)-ifopln,) 
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< 



(70) 



where < £ < 1 is an arbitrarily small constant and [/^ > is an a.s. finite random variable. 
To conclude, consider the triangle inequality 



Substituting ^ and dTD]) into ^ yields 

(/op^,^fW)-(/op„7rO 



/ 



(71) 



(72) 



l^l~e ^min{l-e,7} — ^,min{l-e,7} ' 

where the random variable Q^- = + Cf\\pt\\ooQ' > is a.s. finite and independent of k. 

□ 

In statistical signal processing and machine learning applications it is often of interest to evaluate 
the Shannon entropy of a probability measure tt. Assuming that tt has a density p w.r.t. the Lebesgue 
measure, the entropy of the probability distribution is 



■H{n) = -(logp,7r) 



log [p{x)] p{x)dx, 



where S is the support of p. In the case of the filtering measure ttj, it is natural to think of a particle 
approximation of the entropy Hint) constructed as 



N{k) 



Nik) 

Unfortunately, the log function is neither bounded nor Lipschitz continuous and, therefore, Theorem [7] 
does not guarantee the convergence H{TTt)'' — s- H{TTt). Such a result, however, can be obtained, with a 
more specific argument, if we assume the support of the density pt to be compact. 

Theorem 8 Let the sequence of observations Yi-^ ~ yi-.r (for some large but finite T) be fixed and assume 
that gl" £ i?(R''^), (g^^j^t) > 0, 1 <t <T, and the integrability conditions in Proposition[I\ part (b), are 
satisfied. If there exists a compact set S C M'^^ such that J^pt{x)dx = 1 and mixes Pt{x) > 0, then 

lim \n{TTt)'' - nint)] =0 a.s. 



Proof: Wg apply the triangle inequality to obtain 



(-logp^7rf('^')-(-logft,^t) 



< 



/ , fc N(k)^ , , Af(fc) 

(-logK,7ri >) - (-logpt,7ri 
(-logpt,7rf'') - (-logpt,7rt) 



(73) 



and then analyze the two terms in the right-hand side of ((73 
The first one can be expanded to yield 



(-logpf,7rfW)-(-logft,-f^-^) 



< 



Vt{x\ ') 



N{k) 



i=l 
Nik) 



N{k) ^ 



i=l 



lOE 



(74) 
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The logarithm of a ratio x/y can be upper bounded as 



^ < maxja;,;/} _ ^ 
y ~ min{x,?/} 



(75) 



hence applying ([75)) into (|7^ we arrive at 

(-logPt,7rt - (-logPt,7rt 



JV(fe) 

- 7V(fc) ^ 



max 



(76) 



However, from Theorem [5] and Remark [TTl limfc_j.ooPK2;)/pt(2;) = linijv-yoo Pt(-'z;)/Pt (•'J;) = 1 a-S. for every 
X G 5. Moreover, since we have assumed 'vaix^sVii,^) > 0, it follows that for any e > there exists 
independent of x such that, for all k > k^, 



max 



(77) 



Substituting (|77)) into ([75)1 yields, for all k > k^, 



( 1 k N(k)-, , , N{k)-. 



< e a.s. 



lim 

fc— >CX3 



Since e can be as small as we wish, 

lim (-logp^',7rf ^''^) - (-logpt,7rf^''^) =0 a.s. 
The second term in ([75)1 converges to because of Proposition [1] part (b), i.e. 

a.s. 

and taking together Eqs. ([75)). ([7^ and ([75)1 we arrive at 



(-logpt,7rf'') - (-logpt,7rt; 



(78) 



(79) 



lim 



(-logp^^f('=V(-logp^,7r,) 



a.s. 



□ 



Example 4 We continue to use the model of Section \4-4\ to numerically illustrate the particle 
approximation of'H{'Kt). Since the densities pt for this example are Gaussian with known covariance 
matrices Sj, t — 1,2, we can compute their associated Shannon entropies exactly, namely 

H(7r,) = ilog((27re)'*^|S,|), 

where \ T,t \ is the determinant of matrix "Et- Taking t ~ T = 50 and using the same sequence of observations 
?/i:50 as in Section \4.4\ the resulting entropy is 'H{'Kt) = 2.5998 nats. 

Let us point out that, obviously, the Gaussian distribution has an infinite support and, therefore, the 
convergence result of Theorem cannot be rigorously applied. However, the Gaussian pdf is light-tailed 
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k = 3 


/c = 4 


k = 5 


mean 


0.0616 


0.0370 


0.0128 


std 


0.0453 


0.0249 


0.0091 



Table 1: Empirical mean and standard deviation of the entropy-approximation error, 
\T~1-{ttt) — H(7rr)'^|, averaged over 30 independent simulations. The entropies are evaluated in nats. 

and, as can be observed from Figure\^d), it can he truncated within a compact (rectangular) support and 
still yield a faithful representation of the original distribution. 

Table [I] displays the empirical mean and standard deviation of the absolute error 

\'H{TTT)-'H{'KTf\ 

obtained through computer simulations for N[k) = and fc = 3, 4, 5. To he specific, we carried out 30 
independent simulation runs for each value of k. We observe how both the mean error and its standard 
deviation reduce guickly as k is increased. 

6 Summary 

We have addressed the approximation of the sequence of filtering pdf's of a Markov state-space model 
using a particle filter. The numerical technique is conceptually simple. We collect the N particles 
generated by the sequential Monte Carlo algorithm and approximate the desired density as the sum of 
N scaled kernel functions located at the particle positions. The main contribution of the paper is the 
analysis of the convergence of such particle-kernel approximations. In particular, we have first proved 
the point-wise convergence of the approximation of the filtering density and its derivatives as the number 
of particles is increased and the kernel bandwidth is correspondingly decreased. Explicit convergence 
rates are provided and they are sufficient to prove that the approximation errors vanish a.s. Under 
mild additional assumptions on the chosen kernel, it is possible to extend the latter result to prove 
that the approximation error converges uniformly on the support of the filtering density (rather than 
point-wise) and a.s. to 0. We have also found an explicit convergence rate for the supremum of the 
approximation error. The analysis establishes a connection between the complexity of the particle filter 
and the bandwidth of the kernel function used for estimating the filtering pdf. For a fixed number of 
particles N, this relationship yields an optimal value of the bandwidth. 

The uniform approximation result has a number of applications. We have first exploited it to prove 
the convergence, in total variation distance, of the continuous measure generated by the estimated density 
toward the true filtering measure. In a similar vein, we have also shown that the MISE of the sequence 
of approximate densities converges (quadratically with the kernel bandwidth) toward when the state 
space is compact. For a truncated version of the density approximation, the (random) ISE is also shown 
to converge a.s. toward without assuming compactness of the support. Although the convergence rate 
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found for the ISE is only quadratic (versus fourth order for the asymptotic approximation of the MISE in 
classical kernel density estimation theory) , one should be aware that all the results obtained in this paper 
remain valid whenever the density estimator is obtained by smoothing a discrete random measure 
that is "good enough" to estimate integrals of bounded functions in such a way that the Lp norms of the 
approximation error convergence as (in particular, we do not require to have samples from the target 
density pt). As a consequence, the results obtained here can be applied to, e.g., kernel density estimators 
built from importance samples as in |45) , or to the analysis of bootstrapped estimators as considered in 

m 

We have also proved that the maxima of the approximate filtering density converge a.s. toward the 
true ones. Therefore, MAP estimation of the state at time t can be carried out using gradient and 
Newton search methods on the approximate filtering pdf. We remark that it is sound to apply such 
methods on the approximate function, since we have proved convergence also for its derivatives. The last 
application we consider is the approximation of functionals of the filtering pdf. We provide a general result 
that guarantees the convergence of the particle-kernel approximations for general bounded and Lipschitz 
continuous functionals of the filtering density. Finally, we prove that it is also possible to use the proposed 
constructs to approximate the Shannon entropy of densities with a compact support. In order to arrive 
at this result, we have also proved the convergence of the particle filter approximations of integrals of 
unbounded test functions under very mild assumptions (essentially, the integrability of the function up to 
fourth order). This is a departure from most existing approaches, which assume bounded test functions. 

The theoretical results are complemented with some numerical examples, that include the 
approximation of a Gaussian density, its maximization using gradient and Newton methods and the 
approximation of its Shannon entropy. The reason to carry out all examples for a Gaussian pdf is that 
we can compute it exactly using a Kalman filter and, therefore, we can have a proper reference to assess 
the performance of the particle-kernel methods. In the case of the Shannon entropy approximation, 
the Gaussian distribution does not comply with the compact support assumption. However, being a 
light-tailed distribution, our simulations show that accurate estimates can still be computed. 
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A Proof of Proposition [1] 

Part (a) of Proposition [T] is a straightforward consequence of |39[ Lemma 1], hence we focus here on part 
(b). 
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We need to introduce an auxiliary result first. Let {0j;}i=i,...,7v be a collection of zero-mean random 
variables, conditionally i.i.d. given a cr-algebra Q. In particular, £'[0^1^^] — for all i and, for any integrable 
function h, 

E[h{0,)h{0j)\g] = E[hi9,)\g]E[hi9,)\g] 

for any i ^ j. We have the following basic inequality 

Lemma 2 If the zero-mean, i.i.d. variables 9i, i = 1,...,N, are p-integrable then 

N \ P 

N 



E 



i=i 



< -rEr,\g], 

N 2 



(80) 



where c is a constant independent of N . In particular, if p > 2, 

N 



lim 



-Ye, 



= a.s. 



(81) 



Proof: The inequality (|50)) is a straightforward consequence of the Marcinkiewicz-Zygmund inequality 
[55] (see also [211 Theorem 3] or [U Chapter 10]). For p > 2, a standard Borel-Cantelli argument yields 
Eq. dHU). □ 

Lemma m enables us to prove the convergence of particle approximations (/, tt^) toward (./, TTt) when 
the function / is integrable w.r.t. ttj but not necessarily bounded. We prove this result by induction in t. 



For t = 0, the samples x 



1, N, are i.i.d. with common distribution ttq, hence the approximation 



ttq is constructed as ttq (dx) = X]i=i ^^i^iidx). It is, therefore, apparent that 

i?[(/,^„^)-(/,^o)] -0 
and a simple application of Lemma [2] (with 6i ~ fix^'^) — (/, ttq)) yields 



lim K/i'^o') - (/:7ro)| = a.s. 



N 



Next, we consider the induction hypothesis 



lim |(/,O-(/,7r0| =0 a.s. 



(82) 



and study the particle approximations at time t + 1. 

We start with the convergence of integrals of the form (/, ^t+i), where ^t+i = Tt+iTTj, as defined in 
Eq. ([7]). The particle approximation of has the form 

N 

N 



1 ^ 

cf(dx)=-5:^^,„,(dx), 



hence the integral (/, Ct+i) has conditional mean 



£^[(/,ef+i)l-^*]-(/,n+i^f), 
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given the cr-algebra Tt = cr ( xj"'' , xi""* : I < n < N, 0<Z<i, 0<s<i).ln particular, the random 



variables 6i = /(x^^j^) — (/, tj+itt^) are centered and i.i.d., and Lemma [5] yields 



E 



(/,^ili)-(/,Tt+i7rf)n^, 



N\\P 



= E 



N 



N 



1^* 



< 



~ctE[e1\Tt 



(83) 



where Ct is a constant w.r.t. N. Since both / and ft+i are p-integrable, it follows that ElO^lTt] is a finite 
and non-negative random variable. Taking unconditional expectations on both sides of (|83p . we arrive at 



E 



(/,Cili)-(/,rt+i7r, 



N\P 



< 



Ct 



(84) 



where Ct — CtE [E[9^\J^t]] is a constant w.r.t. N. From ([M)) it is straightforward to prove (e.g., using 
Lemma [T] or a Borel-Cantelli argument with p > 2) that 



hm |(/,Ci^i)-(/,n+i^r)| =0 a.s 



(85) 



Moreover, given the function ft+i{x) = J f{x')Tt+i{dx'\x), we can write (/, Tt+i7r^) — (/t+i,7r^) and 
(/, Tt+i,7rt) = (/t+i,7rt) and, as a consequence, 

|(/,n+i7rf ) - {f,Tt+iTTt)\ = |(/t+i,7rf ) - (/t+i,7rt)| . 

Note that /t+i is a real hmction on M"^^ and we have assumed that (/f+i, tt^) < oo. Therefore, for p > 2, 
we can apply the induction hypothesis ()82[) to arrive at 



lim |(/t+i,7rf ) - (/t+i,7rt)| = a.s. 



N 



(86) 



It is now straightforward to prove the convergence of {f,£,i^i). Specifically, from the triangle inequality 
we obtain 



lim (/,Cr+i) - (/,6+i) < lim (/-^t+i) - + lim (/,r,+i7rf ) - (/,6+i) (87) 

N-^oo N^oo TV— s-oo 



and, since ^i+i — Tt^iTTt- substituting ([85j) and ([86jl into f|87|l yields 

lim |(/,?t^i)-(/,6+i)| =0 a.s. 
In order to analyze the approximations (/, vr^^)^ '^'^ ^^^^^ consider the measure 



(88) 



N 



TTf^i (da:) = ^ wj"; (5_(„)^ (da:) , where 



(n) _ t/f+1 l-^f+l/ 



obtained immediately after the computation of the weights in the particle filter. Integrals w.r.t. tt^^i can 
be written in terms of g^^i and namely 

/ r -N \ _ U yt+i 1 st+i7 
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This is natural, though, since from the Bayes theorem we readily derive the same relationship between 
Ht+i and ^t+i. 



(/5f;r,6+i) 



(ffrir,6+i) ' 

and the difference (/, tt^j^) — (/, TTt+i) can, therefore, be written as 



(/,7f,^i)-(/,7r,+i) 



(5t+i\6+i) 

UaT+i ^^t+i) {aT+i ; 6+1 ) - (5t+i 1 Ct+i ) 

X 



r .yt+i \ 
li/t+i ist+i; 



(fft+iS6+i) 



Note that, since 5^;^ ^ ^(1^'^''), Eq. dH]) yields 



lim {glX\' , 6+1 ) - (fft+i , ^f+i ) = a.s. 
and, since the assumptions in the Lemma impose that 

(ffr;r,6+i)>o, 

it follows that 



N 



lim (5t+i,^t^i) > a.s. 



(89) 



(90) 



(91) 



(92) 



Also as a consequence of G i?(]R'^'=), the function fgi^i is p-integrable and Eq. guarantees, for 
p> 2, that 

oW'+i c. .\ _ t . . cN \\ _ n 00 (93) 



lim , 6+1) ~ (/5t+i, ef+i) = a.s. 

A* — ^00 



and, in particular 



N 



lim (/fft+ij^t^i) < 00 a.s. 



(94) 



Taking Eqs. (|9T|) and ((92|) together, we deduce that the denominator of (|89|) is a.s. positive (as — ^ 00). 
Similarly, Eqs. ([TO]) . (|93p and ([M)) show that the numerator of ([55]) converges to a.s. as TV — 00. 
Therefore, 



lim (/,7rt+i) - (/,7rt+i) a.s. 



(95) 



On the other hand, the difference |(/, tt^j^) — (/, 7rt+i)| also converges to a.s. Indeed, given the 



tr-algcbra J^t — cr [xi- ,x. 



J") =("). 



n = l,...,Af, 0<r<t, 0<s<t 



,iV, 



hence the random variables 6i = f(xl^■^^) — (/, """t+i) are centered and conditionally i.i.d. given J-t+i- As 
a consequence. Lemma [2] yields 



E 



1 ^ 
A ^ 



< 



AT* 



(96) 
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where ct+i is a constant and ~ (/' ^t+i) ^ (/; ^t+i)- If take full expectations on both sides 

of ([M|) . wc obtain 



(/,<i)-(/,^f+i)ri-^*+i (97) 

where Ct+i = Ct+iE [E[6^\^t+i]] is a finite constant. For p > 2, from Eq. (IWl) . we readily deduce that 

lim |(/,7r,^i)-(/,7f,^i)| =0 a.s., (98) 

using either Lemma [1] or a standard Borel-Cantelli argument. 
Finally, taking together ([55)) . ([M)) and the triangle inequality 

|(/,<i) - (/,^,+i)| < |(/,<i) - (/,^t+i)| + la^ri^i) - , 
we arrive at limAr^oo |(/, t^^i) - (/, 7rt+i)| = a.s. 

B Proof of Lemma [1] 

Choose a constant f3 such that v < /3 < p — 1 and define the random variable 

C30 

The variable U^''^ is obviously non- negative and, additionally, it has a finite mean, E[U^'^] < oo. Indeed, 
from Fatou's lemma 

oo oo 
m=l m=l 

where the second inequality follows from Eq. (|2Qp . Since (3-1^ >0, it follows that Y.^^=i m-^''^^-"^ < oo, 
hence Ep^'P] < oo. 

We use the so-defined random variable U^'P in order to determine the convergence rate of 9'^. 
Obviously, kP-^-f'{e'')P < U^'P and solving for 0^ yields 

nk^{UP'Py 



k p 



If we define e = and ~ [U^'P) ^ , the we obtain the inequality 

Since E[U^'P] < oo, it follows that E[{U^)p)] < oo, hence is a.s. finite. Also, we recall that 
ly < (3 < p - 1, therefore < e <1. 
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